{
  "nbformat": 4,
  "nbformat_minor": 0,
  "metadata": {
    "colab": {
      "provenance": []
    },
    "kernelspec": {
      "name": "python3",
      "display_name": "Python 3"
    },
    "language_info": {
      "name": "python"
    },
    "accelerator": "GPU",
    "gpuClass": "standard"
  },
  "cells": [
    {
      "cell_type": "markdown",
      "source": [
        "# Introducing paperai\n",
        "\n",
        "[paperai](https://github.com/neuml/paperai) is a semantic search and workflow application for medical/scientific papers. Applications range from semantic search indexes that find matches for medical/scientific queries to full-fledged reporting applications powered by machine learning.\n",
        "\n",
        "This notebook gives a brief overview of paperai."
      ],
      "metadata": {
        "id": "uSRZj8RWSBVh"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "# Install dependencies\n",
        "\n",
        "Install `paperai` and all dependencies. This step also downloads input data to process."
      ],
      "metadata": {
        "id": "0sqt7nEFSmO4"
      }
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "i-DF6MYkR7zX"
      },
      "outputs": [],
      "source": [
        "%%capture\n",
        "!pip install git+https://github.com/neuml/paperai scipy==1.10.0\n",
        "\n",
        "# Download NLTK data\n",
        "!python -c \"import nltk; nltk.download('punkt')\"\n",
        "\n",
        "# Download data\n",
        "!mkdir -p paperai\n",
        "!wget -N https://github.com/neuml/paperai/releases/download/v1.10.0/tests.tar.gz\n",
        "!tar -xvzf tests.tar.gz"
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "# Index data\n",
        "\n",
        "First, we'll index a dataset previously created with [paperetl](https://github.com/neuml/paperetl)."
      ],
      "metadata": {
        "id": "Uevu0XV9dynm"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "!python -m paperai.index paperai"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "7YTX5coajUap",
        "outputId": "e771f09e-cee3-4e49-c1bc-d6ea526cbdf9"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "Building new model\n",
            "Iterated over 34959 total rows\n"
          ]
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "The index process reads each row from the sections table in an articles database and builds an embeddings index. In this case, 34,959 text sections were indexed."
      ],
      "metadata": {
        "id": "NT5VMndSd7sF"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "# Query data\n",
        "\n",
        "Next, we'll run a sample query to find matching articles. The command below runs a similarity query for `COVID-19 and hypertension` and returns the top 2 documents with a score of at least 0.75."
      ],
      "metadata": {
        "id": "clBhJfoweGIZ"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "!python -m paperai.query \"COVID-19 and hypertension\" 2 paperai 0.75"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "CZeSKYJxzn82",
        "outputId": "285e5cd3-81a8-40f5-bbfe-65322ef91c90"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "Loading model from paperai\n",
            "\u001b[91mQuery: COVID-\u001b[0m\u001b[1;91m19\u001b[0m\u001b[91m and hypertension\u001b[0m\n",
            "\n",
            "\u001b[36mArticles\u001b[0m\n",
            "\n",
            "Title: Associations with covid-19 hospitalisation amongst 406,793 adults: the UK Biobank prospective cohort study\n",
            "Published: 2020-05-11\n",
            "Publication: None\n",
            "Entry: 2020-05-19\n",
            "Id: 0l10n9n7\n",
            "Reference: \u001b[4;94mhttp://medrxiv.org/cgi/content/short/2020.05.06.20092957v1?\u001b[0m\u001b[4;94mrss\u001b[0m\u001b[4;94m=\u001b[0m\u001b[4;94m1\u001b[0m\n",
            "\u001b[94m - (0.8635): As with the reports from case series, 17 we observed that a history of hypertension was a strong risk factor for severe covid-19.\u001b[0m\n",
            "\u001b[94m - (0.8062): This suggests it is more likely that hypertension per se, in particular hypertension severe enough to require multiple medications, that is the main risk factor for severe covid-19.\u001b[0m\n",
            "\u001b[94m - (0.7860): Additionally, we examine whether antihypertensive medication use is associated with risk of severe covid-19.\u001b[0m\n",
            "\u001b[94m - (0.7718): Nevertheless, questions remain as to why hypertension should be such a strong risk factor for covid-19.\u001b[0m\n",
            "\u001b[94m - (0.7647): Though in univariable analyses all classes of antihypertensive medication were associated with increased risk, in detailed multivariable analyses none of these were independently associated with severe covid-19 infection after adjusting for hypertension history, age, sex and ethnicity; rather it was the number of antihypertensive medications in use that was significantly related, which is probably a surrogate for severity of hypertension.\u001b[0m\n",
            "\n",
            "Title: Management of osteoarthritis during COVID‐19 pandemic\n",
            "Published: 2020-05-21\n",
            "Publication: Clin Pharmacol Ther\n",
            "Entry: 2020-06-11\n",
            "Id: uxfk6k3c\n",
            "Reference: \u001b[4;94mhttps://doi.org/10.1002/cpt.1910\u001b[0m\n",
            "\u001b[94m - (0.7881): Also, arterial hypertension may be associated with increased risk of mortality in hospitalized COVID-19 infected subjects .\u001b[0m\n",
            "\u001b[94m - (0.7609): A recent meta-analysis showed that the most prevalent COVID-19 comorbidities were hypertension, cardiovascular diseases and diabetes mellitus , and their presence increased life threatening complications.\u001b[0m\n",
            "\n",
            "\n"
          ]
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "# Query data with Python\n",
        "\n",
        "The section above queried for matches via a command line program. The section below shows how the same data can be pulled programmatically with Python."
      ],
      "metadata": {
        "id": "W5i6eiK0ebiE"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "import pandas as pd\n",
        "\n",
        "from paperai.models import Models\n",
        "from paperai.query import Query\n",
        "\n",
        "from IPython.display import display, HTML\n",
        "\n",
        "# Load model\n",
        "embeddings, db = Models.load(\"paperai\")\n",
        "cur = db.cursor()\n",
        "\n",
        "def search(query, topn, threshold):\n",
        "  # Query for best matches\n",
        "  results = Query.search(embeddings, cur, query, topn, threshold)\n",
        "\n",
        "  # Get results grouped by document\n",
        "  documents = Query.documents(results, topn)\n",
        "\n",
        "  articles = []\n",
        "\n",
        "  # Print each result, sorted by max score descending\n",
        "  for uid in sorted(\n",
        "    documents, key=lambda k: sum(x[0] for x in documents[k]), reverse=True\n",
        "  ):\n",
        "    cur.execute(\n",
        "      \"SELECT Title, Published, Publication, Entry, Id, Reference \"\n",
        "      + \"FROM articles WHERE id = ?\",\n",
        "      [uid],\n",
        "    )\n",
        "    \n",
        "    article = cur.fetchone()\n",
        "\n",
        "    matches = \"\\n\\n\".join([text for _, text in documents[uid]])\n",
        "\n",
        "    article = {\n",
        "      \"Title\": article[0],\n",
        "      \"Published\": Query.date(article[1]),\n",
        "      \"Publication\": article[2],\n",
        "      \"Entry\": article[3],\n",
        "      \"Id\": article[4],\n",
        "      \"Content\": matches,\n",
        "    }\n",
        "\n",
        "    articles.append(article)\n",
        "\n",
        "  df = pd.DataFrame(articles)\n",
        "  display(HTML(df.to_html(index=False).replace(\"\\\\n\",\"<br>\")))\n",
        "\n",
        "search(\"COVID-19 and hypertension\", 2, 0.75)"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 373
        },
        "id": "1CZ65OTs0FS4",
        "outputId": "b4de90bb-d767-47fd-83e9-6c43b4296e02"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "Loading model from paperai\n"
          ]
        },
        {
          "output_type": "display_data",
          "data": {
            "text/plain": [
              "<IPython.core.display.HTML object>"
            ],
            "text/html": [
              "<table border=\"1\" class=\"dataframe\">\n",
              "  <thead>\n",
              "    <tr style=\"text-align: right;\">\n",
              "      <th>Title</th>\n",
              "      <th>Published</th>\n",
              "      <th>Publication</th>\n",
              "      <th>Entry</th>\n",
              "      <th>Id</th>\n",
              "      <th>Content</th>\n",
              "    </tr>\n",
              "  </thead>\n",
              "  <tbody>\n",
              "    <tr>\n",
              "      <td>Associations with covid-19 hospitalisation amongst 406,793 adults: the UK Biobank prospective cohort study</td>\n",
              "      <td>2020-05-11</td>\n",
              "      <td>None</td>\n",
              "      <td>2020-05-19</td>\n",
              "      <td>0l10n9n7</td>\n",
              "      <td>As with the reports from case series, 17 we observed that a history of hypertension was a strong risk factor for severe covid-19.<br><br>This suggests it is more likely that hypertension per se, in particular hypertension severe enough to require multiple medications, that is the main risk factor for severe covid-19.<br><br>Additionally, we examine whether antihypertensive medication use is associated with risk of severe covid-19.<br><br>Nevertheless, questions remain as to why hypertension should be such a strong risk factor for covid-19.<br><br>Though in univariable analyses all classes of antihypertensive medication were associated with increased risk, in detailed multivariable analyses none of these were independently associated with severe covid-19 infection after adjusting for hypertension history, age, sex and ethnicity; rather it was the number of antihypertensive medications in use that was significantly related, which is probably a surrogate for severity of hypertension.</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <td>Management of osteoarthritis during COVID‐19 pandemic</td>\n",
              "      <td>2020-05-21</td>\n",
              "      <td>Clin Pharmacol Ther</td>\n",
              "      <td>2020-06-11</td>\n",
              "      <td>uxfk6k3c</td>\n",
              "      <td>Also, arterial hypertension may be associated with increased risk of mortality in hospitalized COVID-19 infected subjects (38) .<br><br>A recent meta-analysis showed that the most prevalent COVID-19 comorbidities were hypertension, cardiovascular diseases and diabetes mellitus (21, 32) , and their presence increased life threatening complications.</td>\n",
              "    </tr>\n",
              "  </tbody>\n",
              "</table>"
            ]
          },
          "metadata": {}
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "# Reports\n",
        "\n",
        "The last item we'll cover is running a simple report. Reports run a series of queries combined with a list of extractive QA queries. This combination builds structured outputs designed to bulk query large article datasets."
      ],
      "metadata": {
        "id": "Op9DCik7ezlf"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "%%writefile report.yml\n",
        "name: Report\n",
        "\n",
        "Hypertension:\n",
        "    query: COVID-19 and hypertension\n",
        "    columns:\n",
        "        - name: Date\n",
        "        - name: Study\n",
        "        - {name: Sample Size, query: number of people/patients, query: how many people/patients, type=int}\n",
        "        - {name: Comorbidities, query: covid-19 and hypertension, question: what diseases}"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "DV1tIPk1SO5e",
        "outputId": "56bbb4a0-3933-4d22-efdf-74575bf98c05"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "Overwriting report.yml\n"
          ]
        }
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "!python -m paperai.report report.yml 5 csv paperai"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "NalY7OafUKON",
        "outputId": "f65c3881-0d30-41e3-9529-67002c0fa6e8"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "Loading model from paperai\n"
          ]
        }
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "display(HTML(pd.read_csv(\"Hypertension.csv\").to_html(index=False)))"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 206
        },
        "id": "4o04JLEmV3zG",
        "outputId": "44392165-2e5b-454e-8a2b-85ea66f2425b"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "display_data",
          "data": {
            "text/plain": [
              "<IPython.core.display.HTML object>"
            ],
            "text/html": [
              "<table border=\"1\" class=\"dataframe\">\n",
              "  <thead>\n",
              "    <tr style=\"text-align: right;\">\n",
              "      <th>Date</th>\n",
              "      <th>Study</th>\n",
              "      <th>Sample Size</th>\n",
              "      <th>Comorbidities</th>\n",
              "    </tr>\n",
              "  </thead>\n",
              "  <tbody>\n",
              "    <tr>\n",
              "      <td>2020-07-13</td>\n",
              "      <td>Disproportionate impact of the COVID-19 pandemic on immigrant communities in the United States</td>\n",
              "      <td>60.8 million cases</td>\n",
              "      <td>obesity, hypertension, and diabetes--comorbidities</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <td>2020-06-08</td>\n",
              "      <td>Diet Supplementation, Probiotics, and Nutraceuticals in SARS-CoV-2 Infection: A Scoping Review</td>\n",
              "      <td>NaN</td>\n",
              "      <td>systemic inflammation or endothelial damage</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <td>2020-05-21</td>\n",
              "      <td>Management of osteoarthritis during COVID‐19 pandemic</td>\n",
              "      <td>NaN</td>\n",
              "      <td>hypertension, cardiovascular diseases and diabetes mellitus</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <td>2020-05-11</td>\n",
              "      <td>Associations with covid-19 hospitalisation amongst 406,793 adults: the UK Biobank prospective cohort study</td>\n",
              "      <td>406,793</td>\n",
              "      <td>1, 2 or 3+ antihypertensive medications</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <td>2020</td>\n",
              "      <td>COVID-19, Renin-angiotensin System and Hematopoiesis</td>\n",
              "      <td>NaN</td>\n",
              "      <td>Renin-angiotensin System and Hematopoiesis</td>\n",
              "    </tr>\n",
              "  </tbody>\n",
              "</table>"
            ]
          },
          "metadata": {}
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "In this case, we built a report that queries for `COVID-19 and hypertension`. Then it builds a CSV with metadata and extractions for the sample size and comorbidities.\n",
        "\n",
        "Historical reports built for the CORD-19 Kaggle Challenge are available [here](https://github.com/neuml/cord19q/tree/master/tasks). There is also a [Streamlit example application](https://github.com/neuml/paperai/blob/master/examples/search.py) available."
      ],
      "metadata": {
        "id": "ltn9FnDLfE6V"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "# Wrapping up\n",
        "\n",
        "This notebook gave a brief overview of paperai. Applications range from semantic search to more complex reports. More notebooks will be released in the future covering additional aspects of paperai. \n"
      ],
      "metadata": {
        "id": "SiAVyvKihD_r"
      }
    }
  ]
}